-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: fsync sideload sst writes every 2MB #20449
Conversation
🎉 Have you tested this out to make sure data gets flushes as intended? Would you like me to? Reviewed 3 of 3 files at r1. pkg/storage/replica_proposal.go, line 341 at r1 (raw file):
Nit, but is there any reason for all callers to have to provide pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file):
esseially pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file):
Shouldn't we close the file even if there was an error? Leaking the file descriptor doesn't seem like a good idea. Comments from Reviewable |
mod @a-robinson's comments. Reviewed 3 of 3 files at r1. Comments from Reviewable |
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. pkg/storage/replica_proposal.go, line 341 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Whoops, yes. I meant to close it unconditionally and overwrite err with close's err only when it was nil. Comments from Reviewable |
I haven't tested it yet -- was going to pick up your rocks config change too and try out a build with both of them, which, theoretically, should have no giant writes happening during a big restore. Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending. Comments from Reviewable |
pending testing. The necessary rocksdb changes are in as of this morning. One thing to check to make sure this is working as intended is that during the restore you don't see the output of Reviewed 3 of 3 files at r2. Comments from Reviewable |
Good news and bad news. 2GB restore, 4 node roachprod gce cluster: Good news: On master, running Bad news: that 2GB restore went from 36s to 139s. Yikes. |
Ouch, that's pretty brutal. As discussed at lunch, syncing more than 512KB at a time may help. We may be able to parallelize a bit as well if that doesn't work out, but hopefully syncing a little more at a time gets us back to a more reasonable place. It's nice that we won't have more than a gigabyte of dirty data waiting to be flushed anymore! |
Syncing every 4mb was more like 117s and every 8mb got it to 101s. Interestingly, even at 8mb, I still didn't observe a dirty number over ~300kb, though I was only polling once per second. |
one theory was that we could handle more concurrency now that we're not waiting for syncs, but, at first glance, upping the concurrent import request limit just makes it slower and bring back liveness errors :/ more digging is required. |
bumping maybe we want a background fsync and some syncs-in-flight limit, but eh -- that feels slightly like re-inventing what the kernel is supposed to be doing, and we do actually want ssts synced before we call the restore done, which is very much is not currently. |
Well that's frustrating. You and the bulk I/O team will have to decide what sort of performance hit is acceptable. I was wondering, though, did you test with vs without fsync for restores larger than 2GB? It'd be useful to know how the difference scales -- is fsyncing always 2.5x slower, or does it just add 60 seconds to the end of a restore, or somewhere in between? |
It might be a correctness bug that the sideloaded sstables were not previously being synced. An inopportune crash could remove data. I think we have to sync these files after they are written. See also https://stackoverflow.com/questions/15348431/does-close-call-fsync-on-linux. |
Yeah, agreed -- especially since once that restore finishes, we're happy to serve reads against those sstables, and on that 2gb restore, it looked like we still had 650mb+ reported as dirty well after RESTORE returned. So some of the "slow down" here is just that the previous restore numbers were a little misleading, since we hadn't actually written to disk after all. Syncing every 8mb and syncing only after writing each file looks about the same, but I still saw a few slow heartbeat warnings. 4mb and 2mb each look like they add a little more slowdown, presumably indicating that we're waiting for syncs when we could be getting more work done, but get rid of liveness complaints. Switched it to a setting, so we can continue to play with it. RFAL. |
According to my read of the ext4 docs—and my understanding is that all of our local SSDs are mounted as ext4—the mount options default to It is of course quite possible that I'm misreading/misunderstanding something. |
I'd also be curious to see if In any case, these aren't complaints about the approach in this PR! Just musings. |
@benesch just for a couple seconds... but that is still after we'd published the table descs and have promised durability. After switching to a cluster setting, I did a trial run with the syncSize = 128mb, which should mean just one sync per sst, after writing the whole file, and, while that was a tiny bit faster than 4 or 2mb, it was still in the ~100s range, not the <40s range of no-sync, though, I was seeing some heartbeat complaints on those runs. |
Oh, you meant one global |
Any objections to merging this, as-is, and then investigating any improvements we want to make in followups? As-is, we're kinda lying by not syncing at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me. A benchmark (or at least instructions to do the same test run you've been doing) would be nice, though.
Yeah, one sync at the very end is definitely not viable from a raft consistency perspective. It's just shocking to me that periodic fsyncs cause such a slowdown when the data in question can be flushed to disk in a few seconds. |
cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.
#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.